Multiple-source cross-validation

نویسندگان

  • Krzysztof Geras
  • Charles A. Sutton
چکیده

Cross-validation is an essential tool in machine learning and statistics. The typical procedure, in which data points are randomly assigned to one of the test sets, makes an implicit assumption that the data are exchangeable. A common case in which this does not hold is when the data come from multiple sources, in the sense used in transfer learning. In this case it is common to arrange the cross-validation procedure in a way that takes the source structure into account. Although common in practice, this procedure does not appear to have been theoretically analysed. We present new estimators of the variance of the cross-validation, both in the multiple-source setting and in the standard iid setting. These new estimators allow for much more accurate confidence intervals and hypothesis tests to compare algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Confidence for Variable Selection in QSAR Models via Monte Carlo Cross-Validation

A new variable selection wrapper method named the Monte Carlo variable selection (MCVS) method was developed utilizing the framework of the Monte Carlo cross-validation (MCCV) approach. The MCVS method reports the variable selection results in the most conventional and common measure of statistical hypothesis testing, the P-values, thus allowing for a clear and simple statistical interpretation...

متن کامل

Machine Learning Based Drug Indication Prediction Using Linked Open Data

In this study, drug and disease features were obtained by querying open linked data to train our classifier for predicting new drug indications, and the predictive performance of the classifier for different validation schemes was evaluated. We collected the drug and disease data from Bio2RDF, an open source project that uses semantic web technologies to link data from multiple sources. A binar...

متن کامل

SNP detection exploiting multiple sources of redundancy in large EST collections improves validation rates

MOTIVATION Single nucleotide polymorphism (SNP) detection exploiting redundancy in expressed sequence tag (EST) collections that arises from the presence of transcripts of the same gene from different individuals has been used to generate large collections of SNPs for many species. A second source of redundancy, namely that EST collections can contain multiple transcripts of the same gene from ...

متن کامل

Comparative Study on Applicability of Four Software Size Estimation Models Based on Lines of Code

Early estimation of project size and completion time is essential for successful project planning and tracking. Multiple methods have been proposed to estimate software size and cost parameters. Suitability of the estimation methods depends on many factors like software application domain, product complexity, availability of historical data, team expertise etc. We present an empirical validatio...

متن کامل

QSRR Study of Organic Dyes by Multiple Linear Regression Method Based on Genetic Algorithm (GA–MLR

Quantitative structure-retention relationships (QSRRs) are used to correlate paper chromatographic retention factors of disperse dyes with theoretical molecular descriptors. A data set of 23 compounds with known RF values was used. The genetic algorithm-multiple linear regression analysis (GA-MLR) with three selected theoretical descriptors was obtained. The stability and predictability of the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013